Large scale proteomic studies create novel privacy considerations

Privacy protection is a core principle of genomic but not proteomic research. We identified independent single nucleotide polymorphism (SNP) quantitative trait loci (pQTL) from COPDGene and Jackson Heart Study (JHS), calculated continuous protein level genotype probabilities, and then applied a naïve Bayesian approach to link SomaScan 1.3K proteomes to genomes for 2812 independent subjects from COPDGene, JHS, SubPopulations and InteRmediate Outcome Measures In COPD Study (SPIROMICS) and Multi-Ethnic Study of Atherosclerosis (MESA). We correctly linked 90–95% of proteomes to their correct genome and for 95–99% we identify the 1% most likely links. The linking accuracy in subjects with African ancestry was lower (~ 60%) unless training included diverse subjects. With larger profiling (SomaScan 5K) in the Atherosclerosis Risk Communities (ARIC) correct identification was > 99% even in mixed ancestry populations. We also linked proteomes-to-proteomes and used the proteome only to determine features such as sex, ancestry, and first-degree relatives. When serial proteomes are available, the linking algorithm can be used to identify and correct mislabeled samples. This work also demonstrates the importance of including diverse populations in omics research and that large proteomic datasets (> 1000 proteins) can be accurately linked to a specific genome through pQTL knowledge and should not be considered unidentifiable.

Non-discrimination Act of 2008) as well as privacy protection efforts such as the Global Alliance for Genomics and Health, which has created frameworks to ensure responsible and secure sharing of genomic and healthrelated data. A key feature of these policies in the United States is that they explicitly addressed genomic (single nucleotide, sequence, transcriptome, epigenomic, and gene expression) data only. Despite these policies, there have been multiple instances of "deidentified" personal information linked back to individual genetic profiles 4 , including well publicized individuals such as Henrietta Lacks 5 . There have also been methods proposed which can link expression data to genotype through eQTLs 6 .
Although lagging behind genotype and sequencing advances by 5-10 years, exponential technological advances in high throughput proteomics are leading to the creation of similar large databases with sensitive personal information. Concurrently there are studies which demonstrate that many proteins 7,8 have genetic quantitative trait loci (QTLs), but current practice is to consider these datasets as deidentified data. In this manuscript we show that even limited proteome profiles without peptide sequencing can be linked to specific individuals by using prior independent knowledge of these QTLs and we provide a bioinformatic solution which obfuscates reidentification, yet still preserves at least some biomarker-phenotype relationships. These findings suggest an immediate need to change policy regarding non-genomic data used for research or commercial use.

Methods
Study populations. All study participants provided written informed consent approved by institutional review boards (IRBs). COPDGene and Jackson Heart Study (JHS) cohorts were randomly split into training and testing datasets and training subjects were not included in the testing cohort. Other independent cohorts used for testing included Subpopulations and Intermediate Outcome Measures in COPD Study (SPIROMICS) and Multi-Ethnic Study of Atherosclerosis (MESA). Race was self-reported. Characteristics of subjects used for training and test are shown below with summary demographics in Table 1. This manuscript was approved by the publication committees of the cohorts listed below as well as the NHLBI Trans-Omics for Precision Medicine (TOPMed). All research was performed in accordance with relevant guidelines/regulations and informed consent was obtained from all participants and/or their legal guardians. Research involving human research participants was performed in accordance with the Declaration of Helsinki.
COPDGene. The NIH-sponsored multicenter Genetic Epidemiology of COPD (COPDGene (ClinicalTrials. gov Identifier: NCT01969344)) enrolled 10,263 non-Hispanic white (NHW) and Black (AA) individuals from January 2008 until April 2011 (Phase 1) who were aged 45-80 with ≥ 10 pack-year smoking history and no exacerbations for > 30 days and 457 age and gender matched healthy individuals with no history of smoking were enrolled as controls 9 . Subjects were genotyped using an Illumina HumanOmni Express 10 . 1184 subjects from the enrollment visit (P1) participated in an ancillary study in which they provided p100 (BD) fresh frozen plasma used for SomaScan 1.3K proteomic profiling which measured 1305 proteins. An additional 547 independent subjects, who only had SomaScan profiling at a 5-year follow up visit (P2) and not used in the training dataset, were used as an independent testing cohort. 5292 also had SomaScan 5K (v4.0) proteomes using plasma from a P2 visit and were randomly split into training and testing to assess whether scaling improved identification accuracy. COPDGene has been approved by the BRANY IRB.  2 (1990-1992). Genotyping was performed using Affymetrix 6.0 array and imputed using TOPMed Freeze 5b datasets Details of genotyping and imputation quality control methods were previously described 14  Proteome profiling. Proteomic profiles for 1305 proteins were generated using SomaScan v 1.3K (Soma-Logic, Boulder, Colorado). Description of the SomaScan 1.3K assay is further described in 15 . Normalization follow SomaLogic's guidelines for data processing encompass three sequential levels of normalization, namely Hybridization Control Normalization (Hyb) followed by Median Signal Normalization (Hyb.MedNorm) and Interplate Calibration (Hyb.MedNorm.Cal). There are no missing data on the platform. SomaScan 5K v4.0 (4776 proteins) was performed by SomaLogic and we used Adaptive Normalization by Maximum Likelihood (anmlSMP). For pQTL discovery, we used a rank-based inverse normal transformation to align protein levels to a normal distribution; however, for estimating genotype probabilities and associations with smoking, we used log transformed protein values.

Jackson Heart Study (JHS
Statistical analyses. pQTL discovery by protein wide association study (pWAS). COPDGene had genotyping for 691,764 SNPs without imputation. Genotype for these SNPs in JHS were called using TOPMed whole genome sequence. Only SNPs with minor allele frequencies (MAF) greater than 5% in the sample population were included for analysis. Both datasets were aligned to GRCh38. SNP-by-proteins associations were assessed in separately in both the COPDGene and JHS discovery cohorts using linear regression assuming an additive model by genotype. Analysis was performed using the R package 'MatrixEQTL' (version 2.2) 16 . Each model assessed direct association between protein level and genotype, with no adjustment for covariates. Protein quantitative trait loci (pQTLs) were considered significant at FDR corrected p-value < 0.05. The pQTL assessments in JHS and COPDGene were performed independently. After merging the two sets of pQTLs from the two training cohorts, we reduced the set to obtain a list of uniquely associated protein and SNP combinations. For each unique protein in the pQTL set, we kept only the highest significance SNP pQTL as determined by the pvalue for the training cohorts ( Fig. 1). When the two training cohorts had different top SNPs (often in linkage disequilibrium), we chose the SNP from the cohort with the lowest p-value. This first-level reduction produces a set of unique proteins, but in some cases, multiple proteins may be associated with the same SNP. If a SNP was associated with multiple proteins, we used only the protein with the highest protein association for that SNP. This process ensured that each protein and each SNP appear only once in our pQTL sets.
Bayesian modeling. For predicting the probability of genome matching we use a Naïve Bayesian method ( Fig. 2) which estimates the probability of observing genotype vector g using the genotype specific mean (µ) and standard deviation (σ) estimated from training data. This is similar to an approach used in genotype estimation from eQTLs 6 . To combine the training estimates from COPDGene and JHS we used the GaussianNB model from scikit-learn (version 0.23.2) for this estimation 6 . During training, we use the partial_fit method to calculate µ and σ parameters on a single dataset. The same method can be used to update parameters µ and σ, allowing us to train a model on multiple datasets by sharing the trained model. Since each SNP is biallelic, we calculate three probabilities corresponding to the three possible genotypes.
using a Gaussian naïve Bayes framework, where we define three normal probability distribution functions P(g|x) ∝ P(g) · P(x|g) www.nature.com/scientificreports/ which describe the distribution of protein levels for each of the three genotypes ( Fig. 3a), where μ g and σ g are the estimated mean and variance respectively of the protein levels x for subjects with genotype g. Under the naïve Bayes framework, we estimate the probability of the subject possessing each of the three genotype classes, given an observed protein level (Fig. 3b). By repeating this process for each of the N protein/SNP pairs, we obtain the probability of each genotype class for the top 100 SNPs. We calculate the odds of each genotype being the true genotype, and then using the known genotype values g 1 …g N for each subject, we can compute the odds of observing the correct or "true" genotype vector g true for a subject as the product of the odds of observing the individual true genotype values.
For each subject with proteome data, we calculate the odds of the genotype vector of every genotyped subject in the dataset. Assuming one of the genotyped subjects within the dataset is the true identity S true with observed protein levels x true we take the genotype with the highest odds given the observed protein values as the "match" for this subject. If the genotype with the highest odds of match (top 1) belongs to the subject whose protein levels  www.nature.com/scientificreports/ were observed, we consider this a match. We also tested whether the true match was among the three highest odds (top 3) and 1% highest odds (in top 1%).

Associations with smoking.
A T-test was used to assess whether proteins (log transformed) were associated with current smoking (smoking cigarettes in the past 30 days).
Software and packages. All analyses were run in R (version 3.6.11) and Python (version 3.7). The code used in this manuscript is available on GitHub (https:// github. com/ Bowle rLab/ reide ntify_ code).

Results
Model training and parameter optimization. Our first training attempts at model training used only COPDGene subjects, which were mostly subjects with predominant European ancestry. This analysis identified 778 proteins with at least one pQTL SNP. To test the accuracy of protein measurements to predict genotypes, every proteome was assigned a probability of proteome matching genome (Fig. 4). The accuracy of the method was determined by how many times a subject with a proteome had the true genome assigned the highest probability of a match as the first choice, top three choices, or top 1% of the dataset. This method demonstrated excellent testing accuracy in identifying independent subjects of European ancestry in COPDGene, MESA, and SPIROMICS (83-92%); however, testing accuracy in subjects with predominantly African ancestry was significantly lower (61-76%). Therefore, we retrained our models using additional African-Ancestry subjects from JHS subjects. In the JHS training data set we identified 372 proteins with at least one pQTL SNP. We then combined the COPDGene and JHS training pQTLs for a total of 591 proteins with at least one pQTL SNP (Supplemental File 1). Using these combined COPDGene and JHS training set we significantly improved the matching accuracy in African American subjects (Fig. 5) which improved accuracy to ~ 90%, which is similar to accuracy in European ancestry subjects. Next, we sought to determine the minimum number of protein-pQTL pairs that were necessary to match a proteome to a genome. First, we ranked protein-pQTL pairs by p-value and then retested using only smaller subsets of the strongest protein-pQTL pairs (Supplemental Table 1). Using the 1.3K assay overall accuracy plateaued at around 100 of the most significant protein-pQTLs pairs but including all nominally significant protein-pQTLs pairs led to slightly lower accuracy, suggesting that these lower significance pairs were introducing more noise than signal and accuracy and having additional protein information is not informative for matching to genomes.
Testing accuracy of matching proteome to genome across diverse, independent cohorts. Using the top 100 protein-pQTL SNPs from the training data using (COPDGene and JHS training subjects), we then tested prediction accuracies in 4 cohorts (SPIROMICS, MESA, JHS, COPDGene) using independent subjects that had not been used for training, including accuracies based on race and ethnicity ( Table 2). The true match was among the highest odds for most subjects (> 85%) in the cohorts and populations, except for COPDGene and Black Americans in MESA. If we took the top 1% of highest odds, the true match was among the highest odds for most subjects (> 85%) in all cohorts and populations.
To determine whether newer and larger proteome assays were more or less accurate at identifying genetic profiles, we randomly split 5292 COPDGene subjects (71% NHW and 29% AA) who had SomaScan v4.0 5K data (4776 proteins) into training and testing groups using a 50/50 train-test split (Supplemental Table 2) to generate a new list of protein-pQTL pairs (Supplemental File 2). We also used these novel protein-pQTL pairs to match 11,761 proteomes (8987 NHW and 2774 AA subjects) with 12,219 genomes (9345 NHW and 2874 AA subjects) and from the ARIC cohort. With as few as 100 proteins, identification accuracy improved to > 99% (Table 3) and accuracy in subjects with African ancestry was similar to those with predominantly European ancestry although accuracy was still slightly higher in European Ancestry compared to African Ancestry subjects (99% versus 98%). Accuracy was similarly > 98% in ARIC, even when > 92% genotype imputation was needed in   Table 3. Training and testing accuracy of matching proteome to genome for SomaScan 5K data.  www.nature.com/scientificreports/ ARIC. Adding additional protein-pQTL information beyond the top 150 tended to slightly decrease accuracy, most likely due to additional noise. Using the same proteins described above, we show that we can identify individuals even without genetic databases using either the SomaScan 5K (COPDGene) or 7K (SPIROMICS) data. We show this by calculating Euclidean distances in N-dimensional space and show that this distance is the shortest for the same subjects over years compared to unrelated individuals (Supplemental Fig. 1). This demonstrates that the proteome by itself is mostly closely related to the proteome of the same across time. In the JHS cohort there were 314 subjects with proteome profiles and first-degree relatives in the genomic dataset. Among those 125 (39.8%) had at least 1 sibling in the top 1% of matches and 85 folks (27.1%) had all siblings in the top 1% of matches (Supplemental Fig. 2). This demonstrates that a proteome can help identify first degree relatives.
Genome privacy protection through proteome transformation. Since we have shown that measurement of selected proteins with strong pQTLs can provide genetic information similar to a SNP, we reasoned that removing the pQTL effects on the proteome would inhibit the ability to reidentify a subject. One method that accomplishes this is to adjust each protein measurement by subtracting the population mean for that genotype (Fig. 6). This method has the advantage in that if the subject's genotype and the correction factors are known, it is simple to recapitulate the actual protein measurements. In both testing cohorts, subtracting the genotype effect abolished the ability to identify subjects (Fig. 7).

Can genotype adjustment preserve biomarker-phenotype associations?
To test if adjusting for genotype affects associations between biomarkers and phenotypes, we first identified two proteins, sICAM-5 and DERM, which were significantly associated with smoking status in both the COPDGene and SPIROMICS testing cohorts. Next, we assessed the association before and after adjustment for genotype. In both cohorts, associations with smoking status did not change significantly after genotype adjustment (Supplemental Table 3). Using logistic elastic net we are also able to demonstrate that using 67 proteins from COPDGene 5K data, one can predict sex with > 99% sensitivity and specificity (Supplemental File 3 and 4). In SPIROMICS subjects we can also use elastic net to identify self-reported African American race and percent genetic African Ancestry (Fig. 8). The correlation between protein ancestry score and genetic ancestry score was 0.98.
Using the matching algorithm to identify mislabeled samples in existing datasets.. In all our efforts to match proteomes with genomes, our matching accuracy seemed to plateau around 99.8%, even for the platforms with > 5000 proteins. In nearly all cases in which there was not a correct match of proteome to genome, the proteome had a nearly 100% probability of matching to a different genome. This suggests that either the proteome or genome has been mislabeled likely due to a swap of sample during the chain of custody from research subject to data generation. We assessed the extent and causes of poor matching by using SomaScan 7K data from SPIROMICS, in which of 18 of 5132 (0.2%) of proteomes did not exactly match their genome. In 8 of 18 proteomes the subject had multiple visits which generated proteomes, many of which matched to the same genome of a different person's DNA, suggesting that the DNA was mislabeled and came from a different person. In 4 of 18 proteomes, all but one of the proteomes matched correctly to the genome and the mismatched proteome had a corresponding mismatched sample from the same visit. This suggests that a plasma sample was swapped between two subjects at a single visit (see examples Fig. 9). For 6 of 18 subjects who had mismatched genomes and proteomes, there was only one proteome and genome in the database and therefore we could not determine whether it was the proteome or genome that was mislabeled. Shown are accuracy of matching algorithm with (red) and without (blue) removing mean pQTL effect as well as the probably of a random guess matching (grey).

Figure 8.
The proteome can accurately predict the percentage of genetic African Ancestry. In SPIROMICS, pooled genetic ancestry was calculated using genotypes as described (PMCID: PMC6090900). Using SomaScan 7K data we used elastic net to create an Ancestry PC1 (African ancestry) protein score and then used independent subjects to determine the correlation between the percent genetic African ancestry with protein ancestry. The correlation between protein ancestry score and genetic ancestry score was 0.98. www.nature.com/scientificreports/

Discussion
De-identification of data is a key concept for shared research and privacy protection but is not yet used in large scale proteomic studies. While small proof of concept studies have suggested that mass spectrometry can identify missense variants (minor allelic peptides) which can suggest specific SNPs 17 , this approach has not yet been used across large scale cohort studies nor with non-mass spectrometry proteomic data. This study is the first to demonstrate on a large scale that proteomic data are not identity protected because an individual proteome can be matched to a specific genome with high accuracy even without protein sequence information. The key identifying features in the proteome are the effects of common pQTLs, which link a measured protein level to a specific genotype. Furthermore, we show that identification only requires a small number of proteins (as few as 60-100 selected proteins) to link an individual protein profile to a single genetic profile among thousands of subjects and that it is accurate even with imputed genotypes. Additionally, our results suggest that using diverse subjects for selecting the most influential proteins improves overall accuracy, particularly among those with African ancestry and underscores the importance of including diverse subjects in Omics research. We show that proteomic data can identify behavioral features (e.g., smoking) even after removing the features that allow matching to genomes. The ability to accurately identify someone by linking their proteome to a genome, identify risk for protein related disease such as alpha-1 antitrypsin deficiency 18 , infer sex, genetic ancestry, or relatedness and also characterize other characteristics such as body fat, renal function, fitness, smoking, alcohol consumption, diabetes, cardiovascular risk 19 , and age 20 implies that proteomic data should have at least the same (if not more rigorous) privacy protections as genetic and genomic datasets. The two main technological breakthroughs that have facilitated accurately matching an individual proteome to a specific genome are improvement in high throughput proteomic technologies and large scale pQTL studies. Until the last few years, there were no proteomic platforms that could simultaneously and accurately measure more than 100 proteins and there was little known about which of those proteins had strong pQTLs. While our study used three different SomaScan platforms, lack of privacy (de-identification) should be implied for any platform that can simultaneously measure thousands of proteins even when mass spectrometry is not used. The Figure 9. How the matching technique can be used to identified mislabeled omics data. (A) two subjects (1 and 2) were enrolled at the same clinical center at a baseline visit. Their plasma proteomes matched (P = 1) a different subject's genome at baseline from the same clinical center, but their plasma proteomes matched the correct genome at subsequent visits. Another example of this is two subjects (3 and 4) from a different clinical center who appear to have their plasma samples swapped at their year 1 visit. This suggests that plasma samples were swapped at a single clinical center during a single visit and should be relabeled. (B) A subject (Subject A) who has multiple visits in which the proteomes were all mapping consistently to the genome of a different person (Subject B). This suggests that the DNA sample that was used for genotyping was swapped and that the DNA genotype data from Subject A should be labeled as coming from Subject B. Note that the x-axis for all the figures are shown on a log-scale because the probability all the unrelated a proteome matching to an unrelated genome is essentially zero (e.g., P < 10 40 ). www.nature.com/scientificreports/ logical continuation of this principle is that proteomic data could be used to discriminate based on identifying the sex of a subject, ancestry, or paternity. A protein profile could even be used to identify close relatives for forensic purposes. The ability to link proteomes to genomes is not always a bad thing, particularly when cleaning data. For instance, we used matching to identify when genomes or proteomes are likely to have been mislabeled in large cohort databases. When more than 2 omics data sets are available from subjects, use of multiple pairwise matching can even pin-point which data entry is mislabeled. In our work we demonstrate examples of both plasma and DNA samples that are likely to have been swapped and have proposed corrections to the labeling of data. When used in a judicious manner, this matching technique can give confidence and improve the quality of multi-omic databases.
De-identification and privacy protection by informatics is a growing field. We acknowledge that our proposed privacy-preserving measures are only applicable when Naïve Bayes (NB) is used for profiling and we recognize the large body of emerging literature on alternative data obfuscation methods to protect privacy of many types of data 21 . These methods range from industry level data obfuscation/masking and secure data outsourcing techniques such as substitution, shuffling, numeric variance and null-out/mask-out, to more rigorous statistical data obfuscation methodologies used in Hippocratic Databases 22 , and privacy-preserving data mining 23 such t-Closeness 24 , differential-privacy 25 based methods. Machine learning 26 and deep learning 27 are also being used in proteomic feature identification and we may be able to leverage these same methods to isolate and "cloak" identifiable omics features while maintaining desirable statistical properties of the data for downstream application. We also believe new omics-specific privacy-preserving methods must be introduced to preserve privacy with omics data against model evasion attack methods that can target both traditional profiling models (such as NB) and modern deep learning-based profiling models.
Bioethicists had anticipated that other omics data such as proteomic data might one day be identifiable and create privacy concerns 28 and our work demonstrates that this day has come even for proteomic technologies that do not rely on peptide sequencing. Unfortunately, most governmental policies do not yet apply to newer omics data such as proteomics (one exception may be the General Data Protection Regulation in the European Union, which protects biological equivalents of genotypes). We suggest biomedical research policies be clarified or amended to include any omics data (e.g., measurement of proteins or other molecules, such as metabolites) in which genotype can be ascertained 29 , but also that there be consideration beyond genotype equivalents to include all features of omics (e.g. behavioral information such as smoking). Because data protection is imperfect and frequently breached, a complementary solution to maintaining privacy might include bioinformatic and identity preserving adjustments to proteomic data. We demonstrated that adjusting out the genetic effects on protein measurements protects privacy by obfuscating the genetic effects, but it still does not change non-genetic associations (such as smoking). This strategy is simple and can be reversed, if necessary, when a researcher has the accompanying genetic information. A disadvantage to removing genetic coding of the proteome is that it could remove associations in which genotype mediates protein affect. Another caveat from our work is that if training the method does not include diverse populations, the identification methods may not be generalizable outside European ancestry. While lower identifiability may be beneficial, future privacy protection algorithms may suffer if identifying features in underserved population are not fully known.